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Scalability Analysis and Use of Compression at the Goddard DAAC and End-to-End 

MODIS Transfers 

Daniel A. Menasce 

During this reporting period, the work of the consultant was divided into the following two tasks: 
1) Scalability Analysis of ECS’s Data Server (DS) and a Cost/Analysis of the Use of Compression 
and 2) Analysis of End-to-End MODIS Transfers. 

Task 1: Scalability Analysis of ECS's Data Server (DS) and a Cost/Analysis of the Use of 
Compression 

This task comprises a scalability analysis on the use of compression using the sz (science zip) 
compression algorithm, developed by Pen-Shu Yeh, at the Data Server of the Goddard DAAC. 
The scalability analysis was carried out using the queuing-network based model developed by the 
consultant. Three types of compression parameters were obtained to carry out the analysis of the 
various compression scenarios: compression ratio defined as the ratio between the compressed 
and the original file, compression time, and decompression time. Compression ratios were 
obtained from real sensor data. Compression and decompression times were obtained by using 
both real sensor data and simulated MODIS data. The values obtained in the measurements were 
averaged in two different categories: level 1 data and levels 2-4 data. 

Two compression scenarios were analyzed: 

SZDC: use of SZ compression for ingest and distribution of data in compressed form. 

SZUD: use of SZ compression for ingest and uncompression before distribution. Uncompression 
in this case is done by the data distribution process. 

Figure 1 shows the results for the SZDC case and for two retrieval mixes: 20% large files (20L) 
and 80% large files (80L). The figure shows three sets of curves, one for each workload: Ingest 
(ING), Large Retrievals (LR), and Small Retrievals (SR). In each set one can see the results for 
the 20L and 80L cases. The first observation is that for each workload, the 20L case exhibits a 
higher response time than the 80L case. We can also see that for each file size mix, response time 
is higher for ingest, followed by LR, and then SR. Finally, with SZDC the system can easily 
support a flow of 3400 GB/day. 

Figure 2 shows results similar to figure 1, except for the SZUD scenario. The behavior is totally 
different though. Response time is now higher for LR, followed by ING, and then SR. The 
results for 20L and 80L are undistinguishable in the graph. Also, the DS saturates at 800 GB/day. 
This is due to the fact that the data distribution server cannot handle the load of uncompressing 
the large files. 

Since the bottleneck for the SZUD case is the data distribution server, we investigated the effects 
of using data distribution servers with higher capacity. We considered three cases: i) a data 
distribution server with a twice as fast CPU (2x case), ii) a data distribution server with two CPUs 
and each of them being twice as fast as the original CPU (2x 2cpu case), and iii) a data 
distribution server with four CPUs and each of them being twice as fast as the original CPU (4x 
2cpu case). The results are shown in Figure 3. 

As it can be seen in Figure 3, the 2x case not only decreases the response time but also extends 
saturation from 800 to 1200 GB/day. The 2x 2cpu case extends saturation to 2000 GB/day. The 
4x 2cpu scenario does not show signs of saturation even at 3400 GB/day. 
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Figure 2 - SZDU - No Reprocessing 
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Figure 3 - Response Time for Large Retrieval for the 80L Case and SZUD for Various Data 

Distribution Server Scenarios 

A cost analysis of the use of compression at the Goddard DAAC was carried out. The following 
assumptions with respect to compression were made: i) tape compression ratio: 1.5:1, ii) level 0 
data is compressed by tape compression only, and iii) levels 1+: compressed by SZ but not 
compressed further by tape compression. The cumulative storage requirements and costs till 
2007 are summarized in Table 1. It can be seen that the use of compression saves $1.5 million 
dollars. If data is distributed in compressed form, the network cost savings amount to $486 
thousand dollars. 

Table 1 - Cumulative Storage Requirements and Storage Costs. 



level 0 

level 1 

level 2+ 

Total 

Volumes in TB (no compression) 
Volumes in TB (with SZ compression) 
Storage Savings TB 
Costs in $K (no sz compression) 
Costs in $K (with SZ compression) 
Cost Savings in $K 

424 
283 
141 
$ 786 

$ 786 

$ 

1487 
577 
910 
$ 2,955 
$ 1,698 
$ 1,257 

234 
67 
168 
$ 493 

$ 209 

$ 284 

2146 
926 
1,219 
$ 4,235 
$ 2,693 
$ 1,542 


If compression is implemented at the Science Processor instead of the Data Server, one additional 
CPU has to be added to the current configuration assuming that CPUs become faster according to 
Moore’s Law. In the worst case scenario where CPU speed remains constant over the years of the 
project, four additional CPUs will be needed to handle the extra compression/decompression 
load. This additional expense is negligible when compared with the cost savings of using 
compression. 


Task 2: End-to-End MODIS Transfer 













The goal of this task is to analyze the performance of single and multiple FTP transfers between 
SCFs and the Goddard DAAC. We developed an analytic model to compute the performance of 
FTP sessions as a function of various key parameters, implemented the model as a program called 
FTP Analyzer, and carried out validations with real data obtained by running single and multiple 
FTP transfers between GSFC and the Miami SCF. 

The input parameters to the model include the mix of ftp sessions (scenario), and for each FTP 
session, the file size. The network parameters include the round trip time, packet loss rate, the 
limiting bandwidth of the network connecting the SCF to a DAAC, TCP’s basic timeout, TCP’s 
Maximum Segment Size, and TCP’s Maximum Receiver’s Window Size. 

The modeling approach used consisted of modeling TCP’s overall throughput, computing TCP’s 
delay per FTP transfer, and then solving a queuing network model that includes the FTP clients 
and servers. 
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